Semi-automated annotation of page-based documents within the Genre and Multimodality framework
نویسنده
چکیده
This paper describes ongoing work on a tool developed for annotating document images for their multimodal features and compiling this information into a corpus. The tool leverages open source computer vision and natural language processing libraries to describe the content and structure of multimodal documents and to generate multiple layers of XML annotation. The paper introduces the annotation schema, describes the document processing pipeline and concludes with a brief description of future work.
منابع مشابه
The Impact of Noise in Web Genre Identification
Genre detection of web documents fits an open-set classification task. The web documents not belonging to any predefined genre or where multiple genres co-exist is considered as noise. In this work we study the impact of noise on automated genre identification within an open-set classification framework. We examine alternative classification models and document representation schemes based on t...
متن کاملDomain Specific Language in Technical Solution Documents - Discussion of Two Approaches to Improve the Semi-automated Annotation
The efficient search for existing solutions in mechanical engineering is a key-factor for successful product development. Ontology-based knowledge systems can support the semi-automated annotation of documents about existing solutions and enable the retrieval of those documents. However, the use of different wordings for similar products and a generally heterogeneous domain-specific language hi...
متن کاملOntea: Platform for Pattern Based Automated Semantic Annotation
Automated annotation of web documents is a key challenge of the Semantic Web effort. Semantic metadata can be created manually or using automated annotation or tagging tools. Automated semantic annotation tools with best results are built on various machine learning algorithms which require training sets. Other approach is to use pattern based semantic annotation solutions built on natural lang...
متن کاملTowards Automatic Web Genre Identification: A Corpus-Based Approach in the Domain of Academia by Example of the Academic's Personal Homepage
We argue for a systematic analysis of one particular, well structured domain—academic Web pages—with regard to a special class of digital genres: Web genres. For this purpose, we have developed a database-driven system that will ultimately consist of more than 3 000 000 HTML documents, written in German, which are the empirical basis for our research. We introduce the notions of Web genre type ...
متن کاملIJEL 4/1 page layout
Instructional text, and procedural text in particular, is a genre that users heavily rely upon when they are learning new procedures, devices or systems. It is, however, also well-known to be a genre that is difficult to produce and maintain. This article discusses Isolde, an environment that attempts to address this problem by supporting the semi-automated production of procedural instructions...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2016